Fault-tolerant networks

ABSTRACT

Recovery systems and methods for sustaining the operation of a plurality of networked computers ( 20   a   , 20   b ) in the event of a fault conditions are described. The basic recovery system comprises a plurality of virtual machines ( 31   a   , 31   b ) installed on a recovery computer ( 30 ), each virtual machine being arranged to emulate a corresponding networked computer, and the recovery computer being arranged, in the event of a detected failure of one of the networked computers, to activate and use the virtual machine which corresponds to the failed networked computer ( 20 ). The recovery computer ( 30 ) may be located on the same network ( 12 ) as the networked computers ( 20 ), or alternatively on a remotely located local network in case of failure of the entire local network ( 12 ).

TECHNICAL FIELD OF THE INVENTION

[0001] The present invention concerns improvements relating tofault-tolerant networks and particularly, although not exclusively, tovarious methods, systems and apparatus for the back-up of critical databy mirroring multiple critical computers on corresponding virtualmachines which are installed on a single physical back-up computer.

BACKGROUND TO THE INVENTION

[0002] Enterprises around the world require ways to protect against theinterruption of their business activities which may occur due to eventssuch as fires, natural disasters, or simply the failure of servercomputers or workstations that hold business-critical data. As data andinformation may be a company's most important asset, it is vital thatsystems are in place that enable a business to carry on its activitiessuch that the loss of income during system downtime is minimised, and toprevent dissatisfied customers from taking their business elsewhere.

[0003] To achieve business continuity, it is necessary for such a systemto be tolerant of software and hardware problems and faults. This isnormally achieved by having redundant computers and mass storage devicessuch that a backup computer or disk drive is immediately available totake over in the event of a fault. Such a technique is described inOhran et al., International patent application WO 95/03580. Thisdocument describes a fault-tolerant computer system that provides rapidrecovery from a network file server failure through the use of a backupmass-storage devices. There are, however, a number of reasons why thetechniques used by Ohran and others may be undesirable.

[0004] As can be seen from the Ohran, each server requiring afault-tolerant mode of operation must be backed up by a near-duplicatehardware and software architecture. Such one-for-one duplication maymake it infeasible and uneconomic to run a redundant network file serverinstead of a normal network file server. Further, the need for aredundant network to be continuously on-line to ensure that it isupdated at the same time as the normal network server renders its use asan off-line facility for testing infeasible. In addition to this, themajority of redundant networks are unable to provide protection againstapplication failure and data corruption because they either use a commondata source (e.g. the same disks) or they use live data. Also, themajority of redundant networks are unable to provide for test andmaintenance access without the risk of either network downtime or lossof resilience of the network.

[0005] The present invention aims to overcome at least some of theproblems described above.

SUMMARY OF THE INVENTION

[0006] According to a first aspect of the invention, there is provided arecovery system for sustaining the operation of a plurality of networkedcomputers in the event of a fault condition, the system comprising aplurality of virtual machines installed on a recovery computer, eachindividual virtual machine being arranged to emulate a correspondingnetworked computer, and the recovery computer being arranged, in theevent of failure of one of the networked computers, to activate and usethe virtual machine which corresponds to the failed network computer inplace of the failed network computer.

[0007] The advantage of this aspect of the invention is that it allowsmultiple physical computers (or “protected” computers) to be emulated ona single hardware platform, obviating the need for one-to-one copying ofeach networked computer with a separate physical recovery (or “catcher”)computer.

[0008] A further advantage of the invention is that multiple computerscan fail over to one single catcher computer, and this reduces theamount of hardware required to implement the back-up/recovery system.This has the effect of reducing costs. In addition to this, the cost ofusing software to install the virtual machines and associated softwareon the catcher computer is less than the costs of hardware that would berequired to produce an equivalent physical back-up system. And, ofcourse, the less hardware in a system, the lower the maintenance costswill be. In addition to this, managing software implemented virtualmachines is easier than managing physical hardware which again leads toa cost reduction.

[0009] A yet further advantage of the invention is that an entirecomputing environment may be captured in software, and this providesroom for the back-up/recovery system to be scaled-up easily and quicklywithout the need for additional expensive hardware. In addition, one caneasily add additional protected computers that can be protected by thesame catcher computer.

[0010] Each virtual machine (e.g., software that mimics the performanceof a hardware device) is advantageously configured with its ownoperating system and network identity. In addition, each virtual machinewithin the physical catcher computer is advantageously isolated so thatentirely separate computing environments can be hosted on a singlephysical server.

[0011] Preferably, the term computer includes a computer system.

[0012] Preferably the networked computers are physical server computers.Alternatively, the networked computers may be physical workstations suchas personal computers, or a mixture of servers and workstations. Thus,in addition to servers, workstations are also capable of many-to-oneconcurrent recovery.

[0013] The servers may be, for example, SQL servers, Web servers,Microsoft Exchange™ servers, Lotus Notes™ servers (or any otherapplication server), file servers, print servers, or any type of serverthat requires recovery should a failure occur. Most preferably, eachprotected server computer runs a network operating system such asWindows NT™ or Windows 2000™.

[0014] Preferably the plurality of protected computers and the catchercomputer are part of a network such as a local area network or LAN,thereby allowing transmission of information such as commands, data andprograms including active applications between the protected servers andthe catcher. The LAN may be implemented as an Ethernet, a token ring, anArcnet or any other network technology, such network technology beingknown to those skilled in the art. The network may be a simple LANtopography, or a composite network including such bridges, routers andother network devices as may be required.

[0015] Preferably the catcher computer is specified to have sufficientmemory and mass storage device capacity, one or more Intel-compatibleprocessors rated PII or greater, a Network Interface Card (such asEthernet), a Compact Disk interface, a Floppy Disk Drive, serial ports,parallel ports and other components as may be required. In the case of aLAN which includes a large number of protected computers, multiplecatchers each including a plurality of virtual machines may be requiredto provide the recovery function of the invention.

[0016] The catcher preferably has virtual computer software running onit which is programmed to allow logical partitioning of the catchercomputer so that disparate operating environments and/or applicationsetc (such as Microsoft Exchange™) which normally require a dedicatedcomputer platform may be run on the same platform as other instances ofthe same or different environments and/or applications etc. The virtualcomputing software preferably allows multiple concurrent and/orsequential protected server failures to be accommodated and controlledon a single catcher computer. Alternatively, the catcher may bepartitioned into a plurality of virtual machines using, for example,hardware such as Cellular Multiprocessor technology which is used inUnisys ES7000 systems, or by any other suitable hardware technology.

[0017] The catcher virtual computing software provides concurrentsoftware emulation of a plurality of protected computers which isdefined by the manufacturer of the virtual computing software, and isalso subject to limitations imposed by the catcher computer hardware.The maximum number of virtual machines capable of being supported by acatcher is currently sixteen, although this may be increased as thesoftware and/or hardware is developed. However, if the virtual machinesare implemented in hardware, the number of virtual machines supported bythe catcher will again be determined by that hardware.

[0018] The protected computers and the catcher computer preferably bothhave replication software installed thereon which is used in theemulation process. The replication software is preferably capable ofoperating with minimum resource usage, and with low likelihood ofconflict with other applications and hardware.

[0019] The replication software may be programmed to copy selectedapplications, files etc from a protected computer to the catchercomputer via the network by providing the software with either a scriptcontaining instructions, or from instructions input to a graphical userinterface. The replication software may also be instructed to keep theoriginal (i.e. mirrored) copy of the protected server files andapplications synchronised with the catcher copy of the files followinginitial mirroring. This process is known as replication.

[0020] In another embodiment of the invention, the catcher computer islocated remotely, rather than being connected to the LAN. This aspect ofthe invention is known as remote backup. Remote backup is preferablyachieved by sitting the catcher computer at a location other than thatwhich houses the aforedescribed LAN environment. The backup catcher maybe connected to the LAN environment by suitable telecommunicationsmeans. The advantage of this is that in the event of a major incident,all applications and data are safely stored in their most recentlyupdated condition and made available for whatever purpose by accessingor running the catcher.

[0021] Alternatively, in a further embodiment of the invention there isprovided an expanded recovery system comprising the aforedescribedrecovery system and further including additional recovery computers, anda programmable console. This embodiment is advantageous as it provides aback-up and recovery system on a local area network, thereby reducingthe cost and complexity of the wide area networking requirement implicitin the extended remote method which is described later.

[0022] The programmable console preferably provides a module for theautomatic detection and interpretation of events on a network which maytrigger the recovery process, and the storage of such images andcommands, and the construction of such scripts and command files as arerequired to ensure that the optimum set of recovery computers may bedelivered in response to any given failure condition. The programmableconsole preferably also includes a means of selectively switching on thepower circuits of the recovery computers.

[0023] The additional recovery computers and the programmable consolemay be provided on a separate network remote from the local protectednetwork but operably connected thereto, along with an additional catchercomputer (the recovery catcher).

[0024] In this aspect of the invention, the aforedescribed recoverysystem is known as the “protected environment”, and the additionalsystem is known as the “recovery environment” the combined systems beingreferred to as a “remote recovery network”.

[0025] The further local network may be implemented as an Ethernet, atoken ring, an Arcnet, or any other suitable network technology, suchtechnology being known to those skilled in the art. Preferably, theprotected environment and the recovery environment are linked by atelecommunications network (such as kilostream, megastream, T1, T2,leased line, fibre, ISDN, VPN, and using any such devices asappropriate, such as bridges, hubs and routers).

[0026] As with the protected catcher, the recovery catcher preferablyhas installed thereon multiple virtual machines. These virtual machinesare preferably arranged to emulate corresponding multiple protectedcomputers.

[0027] The present invention also extends to a method for sustaining theoperation of a plurality of networked computers in the event of a faultcondition, the method comprising: emulating a plurality of networkedcomputers on a corresponding plurality of virtual machines installed ona single recovery computer; detecting failure of at least one networkedcomputer; attaching the virtual machine which corresponds to the failednetworked computer to the network; and activating and using the virtualmachine in place of the failed networked computer.

[0028] The step of emulating the plurality of protected networkedcomputers preferably includes copying (or “mirroring”) files and. otherdata from the protected servers to the corresponding virtual machines.The mirroring process may be carried out as a once-only activity or,alternatively, it may be followed by a replication process.

[0029] Replication preferably occurs by identifying changes on theprotected computer(s) and applying the same changes to the copy of theinformation on the catcher computer so that the protected computers andthe virtual machines are substantially synchronised. This has thebenefit of the changes in the protected server files being reflected inthe catcher copy of those files within an acceptably short time frame,thereby giving added protection to users of the network such that when afailure occurs, data which was present on the protected servers isultimately available to the users via the catcher computer.

[0030] However, replication does not have to be carried outcontinuously. For example, a network administrator may decide thatreplication does not have to be carried out at night, when no changesare being made to the protected computers.

[0031] Preferably the step of mirroring files from the protected serversto the virtual machines on the catcher computer may be initiated eithermanually, or automatically (for example, by replication software orother suitable monitoring program capable of detecting a pre-programmedsignal or event).

[0032] The method preferably also comprises the step of setting up orinitialising the system so that the recovery function may be provided.The step of setting up the recovery network preferably comprisesconfiguring the catcher computer, creating the virtual machines on thecatcher computer, and duplicating the networked protected computers onrespective virtual machines.

[0033] Preferably, the step of configuring the catcher computer includesinstalling an operating system on the catcher computer and configuringthe operating system parameters. This step preferably further includesinstalling virtual computing software on the catcher computer andconfiguring the virtual computing software parameters. Alternatively,the step of configuring the catcher computer may include configuring thecomputer hardware if the virtual machines are hardware, rather thansoftware, implemented.

[0034] Preferably, the step of creating the virtual machines on thecatcher computer comprises installing substantially identical copies ofthe operating system installed on each protected computer on eachrespective virtual machine, and configuring the virtual machineoperating system parameters. Most preferably, one virtual machine iscreated for each protected computer. The creating step may be carriedout via a virtual computing software management console or othersuitable user interface.

[0035] The step of duplicating the protected computers on the respectivevirtual machines preferably comprises substantially replicating theenvironment installed on each protected computer on the correspondingvirtual machines on the catcher computer. By environment, it is meantapplications, utilities, and any other agents that reside on theprotected computer. This may be achieved by copying the protectedcomputer to a virtual machine using the replication software, andmodifying such registry and system files as may be required.Alternatively, the duplication process may be carried out by restoringbackup tapes and modifying such registry and system files as may berequired, repeating for the virtual machine the exact installationsequence undertaken to arrive at the present condition of the protectedcomputer, or via any other suitable method.

[0036] The method advantageously includes the step of installingreplication software on each protected computer and virtual machine.This step is important as the replication software not only enables thesynchronisation of data between the protected computers and the catchercomputer, but may also monitor the network for failures. Alternatively,the monitoring of the network may be undertaken by a separate monitoringprogram.

[0037] Replication may be initiated either manually or automatically.Most preferably, the replication software is chosen so that normal LANoperations are not interrupted. Preferably, the method further includesthe step of creating replication scripts which instruct the mirroringand/or replication process. These replication scripts may containinformation such as the network identities of the protected (source)computers and the catcher (target) computer.

[0038] Preferably the failure of at least one of the networked protectedcomputers is detected by the catcher computer, either via a virtualmachine or most preferably by a monitoring module. The monitoring modulemay be located in a separate unit remote from the catcher computer, oron the catcher computer itself. Alternatively, the monitoring module maybe part of a virtual machine. Failure may thus be detected by thecatcher computer and/or the monitoring module receiving a failuresignal. The catcher (or other suitable component) may periodically checkthe status of the protected servers on the network, in which case thefailure signal may be the detection of the absence of a “heartbeat”signal from one or more of the protected computers. The failure signalmay be transmitted to the console. If the failure signal is incorporatedin a file, then this file may be sent to the console via File TransferProtocol, or via any other suitable method.

[0039] Failure of a protected computer may be an outright failure of oneor more of the system hardware components. Alternatively, it may includeremoval of the protected computer from the network, either accidentallyor intentionally. It may further include the failure of one or moreapplications (for whatever reason) running on one or more of theprotected computers.

[0040] If any such status information or failure signal indicates thatone or more of the protected computers or an application running orinstalled thereon has failed, the catcher computer can activate theinformation copy held in the virtual machine and can assume theprotected computer's identity. This is achieved by a software agentrunning on the catcher computer preferably automatically activating thecopy of the information (e.g. a Microsoft Exchange™ application) thecatcher computer has previously collected (either by replication ormirroring) from the failed protected computer. The virtual machinecorresponding to the failed protected computer can then advantageouslyassume the failed protected computer network identity. The failedprotected computer can then preferably enter a failed condition, theprocess being known as “failover”.

[0041] The catcher computer may receive information relating to activeapplication programs running on the protected computers, but itpreferably does not run any of the copies of the applications whichexist on the virtual machines until a failure is detected. The advantageof this is that during normal operation of the network, the catchercomputer is only processing data (or file) changes which enable it tocatch information in near real-time from multiple protected computerswithout significant performance degradation.

[0042] The method may further include the step of the catcher continuingto replicate protected servers that have not failed whilst the protectedcomputer is in a failed state. The benefit of this is that users withcomputers connected to the network suffer a tolerably brief period offailed computer inactivity before being able to resume withsubstantially intact information and functionality and LAN performance,that is, with minimum interruption to normal business processing. Inaddition to this, the change over from the failed protected computer tothe virtual machine is also very quick, bringing the same advantages.

[0043] The method may further include the step of repairing or replacingthe failed protected computer, if required. Alternatively, if thefailure was due to a protected computer being disconnected from thenetwork, the step of reconnecting the protected computer to the networkmay occur.

[0044] The method may also include restoring information (e.g.,Microsoft Exchange™ and accompanying files, and Windows NT™ operatingsystem) which had been held on the failed protected computer to thenew/repaired protected server from the catcher computer.

[0045] After the protected computer (which may be either a brand newcomputer, a repaired computer, or the original computer) has beenreconnected to the network, the protected server is preferablyresynchronised with the catcher computer, and the virtual machine maythen relinquish its identity to its corresponding protected server (thisprocess being known as “failback”). The advantage of failback is thatuser access to the protected computer, which had previously failed, maybe resumed with substantially intact information (e.g., operatingsystems, applications and data) with full functionality and LANperformance. Another advantage is that failback may be scheduled to takeplace at a time that is judged convenient, so that users suffer atolerably brief period of downtime.

[0046] A further advantage of the present invention is that a copy ofthe protected computer's environment (i.e. the operating system andapplications/utilities) and files can exist as an “image”, such thatunder normal operating conditions the virtual machine is not performingany of the processing activities being performed on the protectedcomputers, thereby reducing the demand on the catcher computer'sresources, and enabling multiple protected computers to be replicatedeffectively in a relatively compact catcher environment. Anotheradvantage of this aspect of the present invention is that it permitsrecovery of applications running on protected computers such that in theevent of an application failure being detected, but where the protectedcomputer hosting the application remains operational, the catchercomputer is capable of substantially immediately providing continuity ofapplication services to users. The invention also gives the benefit ofproviding a way to permit workstation continuity such that in the eventof a workstation (such as a PC) ailing for any reason, the operatingsystem and applications running on the workstation are substantiallyreplicated on the catcher computer, thereby permitting access to usersfrom other interfaces substantially immediately.

[0047] This aspect of the invention may also be used to enableworkstation hot-desking, such that PC operating systems and applicationsare capable of running on the catcher under normal conditions such thatusers can access them from disparate workstations and locations.

[0048] The aforementioned embodiments of the invention providenear-immediate recovery from single or multiple concurrent computersand/or application failures within a LAN without the need for multipleredundant servers thereby allowing near-continuous business processingto take place.

[0049] It is further desired to provide a method of rapid recovery froma major incident such as a fire or similar destructive incidentaffecting an entire (or part of a) LAN environment. This may be achievedby carrying out the aforedescribed method on a system having a firstlocal network comprising a first protected catcher and a plurality ofprotected computers (the protected environment), and a second remotelylocated network comprising a second recovery catcher and a plurality ofrecovery computers (the recovery environment). This aspect of theinvention provides the ability to replicate remotely the entireprotected environment on the recovery environment. It can also enable anentire protected computer to be replicated on a recovery computer uponfailure of the protected computer.

[0050] This aspect of the invention preferably includes the step ofinstalling the protected environment as previously described.

[0051] The step of installing the recovery environment preferablycomprises: configuring and preparing the recovery catcher; creating andconfiguring a plurality of virtual machines on the recovery catcher,connecting the plurality of recovery computers to the recovery network;and establishing a network connection between the protected environmentand the recovery environment.

[0052] The installation step may also include the step of preparing aprogrammable console and physically attaching it to the recovery networkHowever, in another embodiment of the invention, the console may beattached to the protected network rather than the recovery network. Theprogrammable console may perform the following functions: the automaticdetection and interpretation of events on a network which may triggerthe recovery process; storing images and commands; constructing scriptsand command files which may be required identify the optimum set ofrecovery computers; and selectively switching on the power circuits ofthe recovery computers.

[0053] In this aspect of the invention, the emulation step may becarried out by connecting the protected catcher to the recovery catcherby suitable telecommunications means and copying the virtual machinesfrom the protected catcher to the recovery catcher. Alternatively, theprotected computers may be configured to additionally synchronisedirectly with the virtual machines on the recovery catcher. This secondmethod is advantageous in that loss of access to the protected catcherfor whatever reason does not impact on the ability to recoverapplications or data, but it does place an additional communicationsoverhead on the protected computers as a dual-write is required to theprotected catcher and the recovery catcher during normal operation.

[0054] Preferably the copying step comprises configuring and activatingthe replication software installed on the recovery catcher virtualmachines and their corresponding protected catcher virtual machines sothat files are synchronised between the protected computers and thecorresponding recovery catcher virtual machines. This step may furtherinclude configuring and activating the replication software installed onthe recovery computers themselves, so that they may also be synchroniseddirectly with the protected catcher virtual machines.

[0055] During normal operations, the protected catcher thus preferablyoperates in local recovery mode maintaining a synchronised copy of theprotected computers' information such as applications, files and otherdata. Updated components of this information may be almostsimultaneously transmitted via the local (protected) network to therecovery catcher, thereby maintaining synchronised copies of theinformation on the protected computers.

[0056] The images residing on the protected catcher, the recoverycatcher and elsewhere in the recovery environment can be advantageouslysynchronised with each of the respective protected computers byspecialised means that do not significantly interrupt protectedenvironment operations.

[0057] The method preferably also comprises monitoring the protectednetwork to detect possible failures. The monitoring step isadvantageously carried out by the console.

[0058] As in the previously described local recovery method, if afailure is detected, the appropriate virtual machine on the protectedcatcher preferably substantially immediately adopts the failed protectedcomputer's identity and function. Then, the protected catcher preferablytransmits the identity of the failed protected computer to the recoverynetwork. This may be carried out by the protected catcher sending a fileusing, for example, File Transfer Protocol, or any other protocol thatis suitable for transmitting information between the protected and therecovery environments such as email, short message service (SMS), fax,telephone, WAP, and internet.

[0059] The method preferably includes the further step of identifyingand rebuilding an appropriate recovery computer which may be used toreplace the failed protected computers. The step of identifying andrebuilding the appropriate recovery computer may be carried outautomatically.

[0060] Preferably, the identifying step is carried out by a databasewhich may contain pre-installed information regarding the mostappropriate recovery computer to replace any failed protected computer.The recovery computer may have a pre-installed boot image installedthereon so that upon booting up they are network enabled.

[0061] Preferably, the rebuilding step comprises installing an image ofthe failed protected computer on the (pre)selected recovery computer.This is preferably carried out by rebooting the recovery computer;loading recovery computer image onto the recovery computer; creating areplication script; restarting the recovery computer, and sending asignal to the console that the recovery computer is on-line and readyfor replication. The rebuilding step is advantageously initiated by theconsole.

[0062] The recovery server is preferably restarted via a back door diskpartition. The advantage of this is that it allows the overwriting ofsystem critical files and registry entries without affecting the runningoperating system.

[0063] The method preferably further comprises the step of physicallyrelocating and attaching the replacement replica computer to theprotected network.

[0064] An advantage of the remote recovery aspect of the invention isthat in the event of a major interruption affecting the protectedenvironment, the separate and substantially synchronised recoveryenvironment is unaffected and is capable of being made rapidly availableto user. A further advantage of this aspect of the invention is that inthe event of a failure of single or multiple computers in the protectedenvironment, the recovery computers can be built (either on-line oroff-line), detached from the recovery local network, and physicallyinstalled on the protected LAN within a significantly shorter timeframethan has previously been possible. The replacement computer can beautomatically selected and rebuilt without substantial userintervention, and can be rebuilt on dissimilar hardware.

[0065] The method may further include the step of attaching at least onerecovery server to the recovery environment to replace the recoveryserver that has been removed from the network.

[0066] The above method may also be used to reconstruct the protectedenvironment from the recovery environment should a protected environmentfailure occur.

[0067] In a yet further aspect of the invention, the aforedescribedremote recovery environment may be used to provide a method of carryingout system management activities such as the testing and installation ofupgrades, the method comprising: emulating a plurality of networkedprotected computers on a corresponding plurality of virtual machinesinstalled on a single recovery computer; building replica recoverycomputer(s) while maintaining protected computer operation; physicallydetaching the recovery computer(s) from the recovery environment andattaching them to a test network so that they may be used for systemmanagement activities.

[0068] Once testing and/or the other system management activities arecomplete, operational data may be removed from the detached recoverycomputer(s), and the recovery computer(s) reattached to the recoveryenvironment.

[0069] According to a yet further aspect of the invention there isprovided a method of providing a back-up and recovery networkcomprising: emulating a networked computer on a first virtual machineand a second virtual machine installed on a recovery computer, the firstand second virtual machines containing images of the networked computertaken at different time periods so that, in the event of failure of thenetworked computer, the virtual machine representing the mostappropriate time period can be used to replace the failed networkedcomputer.

[0070] Preferably, the emulating step comprises emulating the networkedcomputer on further virtual machines installed on the recovery computer,each further virtual machine containing an image of the networkedcomputer taken at a time different to that of the other virtualmachines.

[0071] This aspect of the invention may be implemented on any of therecovery systems described herein. Preferably the network furtherincluding means for identifying the most appropriate virtual machinehaving the best “snapshot” of the protected computer environment.Identification of the most appropriate virtual machine may be carriedout either manually or automatically.

[0072] This aspect of the invention provides the facility to managesingle or multiple time-delayed images (i.e., snapshots) of specifiedprotected computers, such images being capable of being captured,stored, loaded and run either via manual intervention or programmably atspecified times and dates on appropriate physical hardware therebyoffering the possibility of recovering from data corruption (known as“rollback”). In this manner, time-delayed copies of protected server andfile images can be automatically obtained and held, and are capable ofbeing rapidly adopted or interrogated.

[0073] It may also provide means to permit specified copies of protectedcomputer images to be authenticated as being substantially free fromcontamination by a computer virus and to provide inoculation of aprotected network with an anti-virus agent (such as those produced bySymantec, Sophos, McAfee, MessageLabs) to facilitate re-introduction ofan authenticated protected computer image to the network and run it, andto selectively permit quarantined transactions to be inoculated andresubmitted to protected applications (known as “virus recovery”).

BRIEF DESCRIPTION OF DRAWINGS

[0074] Presently preferred embodiments of the invention will now bedescribed, by way of example only, with reference to the accompanyingdrawings. In the drawings:

[0075]FIG. 1a is a schematic diagram showing networked protected serversand a catcher server suitable for implementing a method of recoveringdata (the local recovery method) according to a first embodiment of thepresent invention;

[0076]FIG. 1b is a schematic diagram of a virtual machine installed onthe catcher server of FIG. 1a;

[0077]FIG. 2 is a flow diagram showing the steps involved in setting upand using a recovery network for the recovery of critical data,according to at least the first embodiment of the invention;

[0078]FIG. 3a is a schematic diagram showing the protected network ofFIG. 1 in communication with a remote recovery network which is suitablefor implementing another method of recovering data (remote recovery),according to second and third embodiments of the invention;

[0079]FIG. 3b is a schematic diagram showing a further network suitablefor implementing a further method of recovering data (local recovery),according to the third embodiment of the invention;

[0080]FIG. 3c is a schematic diagram of a typical virtual machineinstalled on a catcher computer in the recovery network of FIG. 3a;

[0081]FIG. 3d is a schematic representation of a data record that isused to optimise recovery computer selection according to second, thirdand fourth embodiments of the invention;

[0082]FIG. 3e is a schematic diagram of a further catcher serversuitable for use with the second and third embodiment of the invention;

[0083]FIG. 4a is a flow diagram showing the method steps involved incarrying out the second, third and fourth embodiments of the invention;

[0084]FIG. 4b is a flow diagram showing the steps of creating a recoveryserver according to the second, third and fourth embodiments of theinvention;

[0085]FIG. 5 is a schematic diagram of a network suitable forimplementing a method of recording snapshots of protected servers(applied method), according to a fifth embodiment of the invention;

[0086]FIG. 6 is a flow diagram showing the steps of the applied methodof the fifth embodiment of the invention;

[0087]FIG. 7 is a state diagram illustrating the connectivity sequenceof events during the failover and failback processes according topresently preferred embodiments of the invention; and

[0088]FIG. 8 is a flow diagram showing the network identities of theprotected server and the catcher during the failover and failbackprocesses according to presently preferred embodiments of the invention.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

[0089] Referring to FIG. 1a, there is now described a networked system10 a suitable for implementing a method of backing-up and recoveringdata according to a first embodiment of the present invention. Thesystem 10 a shown includes a first 20 a, a second 20 b and a third 30computer system, which in this case are server computers. Each server 20a, 20 b, 30 is connected to a network 12 through an appropriate standardhardware and software interface.

[0090] The first and second computer systems 20 a,20 b represent serversto be protected by the present embodiment of the invention, and arereferred to herein as “protected” servers. Each protected server 20 a,20b is an Intel-based platform running a respective network operatingsystem 22 a,22 b (such as Windows NT™ or Windows 2000™). The protectedservers 20 a,20 b host one or more respective application programs 24a,24 b (such as Microsoft Exchange™ or Lotus Notes™) and files 26 a,26b, or they may be used as a means of general network file storage. Alsoeach protected server 20 a,20 b includes replication software 36 as willbe described later.

[0091] The third computer system 30 is known as the “catcher”, and isspecified to have sufficient memory, sufficient mass storage devicecapacity, one or more Intel-compatible processors rated PII or greater,Network Interface Card (such as Ethernet), Compact Disk interface,Floppy Disk Drive, serial ports, parallel ports and other suchcomponents (not shown) as may be required by specific embodiments of thepresent invention.

[0092] The catcher 30 runs under an operating system 34 that supportsvirtual computing software 33 (such as GSX or ESX supplied by VMware™).The virtual computing software 33 provides concurrent software emulationof the protected servers 20 a,20 b. The virtual computing software 33 isprogrammed to eliminate the problems encountered when multiple normallyincompatible applications are co-hosted on a single server.

[0093] Each protected server 20 a,20 b is represented as a respectivededicated virtual machine 31 a,31 b by the virtual computing software33. Each such virtual machine has configurable properties for massstorage, memory, processor(s), ports and other peripheral deviceinterfaces as may be required by the server 20 a,20 b, and as may besupported by the virtual computing software 33. In this manner, eachvirtual machine 31 a,31 b contains a near-replica of its correspondingprotected server's operating system 22 a,22 b, files 26 a,26 b,applications and utilities 24 a,24 b and data.

[0094]FIG. 1b shows a protected server image for the first protectedserver 20 a residing as a virtual machine 31 a created by the virtualcomputing software 33. The protected server image includes an operatingsystem 38, applications 35 a, and replication software 36. Theproperties of the virtual machine 31 a are configured via a proprietaryvirtual computing software user interface 35 (such as that supplied aspart of VMware's GSX product) in order to map the protected server 20 aonto the available physical properties managed by the operating system34 running on the catcher 30. The properties of the other virtualmachine 31 b are similarly accessible, and may also be altered via thevirtual computing software user interface 35 running on the catcher 30.

[0095] The function of the replication software 36 (such as DoubleTakesupplied by NSI) is now described. The replication software 36 isprogrammed to run a customisable batch command file 37 (also called a“replication script”) which resides on the catcher virtual machine 31 a.When run, the batch command file 37 is capable of delivering commands tothe operating system 38 and applications 35 a within a specified virtualmachine 31 a (such as suspending replication of data or causing a copyof a specified protected server 20 environment to be connected to thenetwork 12 and activated).

[0096] The replication software 36 is also capable of receiving asequence of pre-programmed instructions in the form of a file-residentscript that contains replication parameters and commands (such as thenetwork address of the source files to be copied). These instructionsare capable of being issued manually via a console 60 (described later)or a GUI of the virtual computing software interface 35. The replicationsoftware 36 is also capable of being programmed to monitor the protectedserver 20 a for user-specified events (such as the absence of a‘heartbeat signal’) and on detection of such an event executing othercommands that may affect the systems and network environment. For eachuse of the replication software 36, there is defined a source serverfrom which information is copied and a target server that receives thecopied information.

[0097] A method of setting up the catcher 30 and carrying out localback-up and recovery of a protected server 20 using the system 10 aaccording to the first embodiment of the present invention is nowdescribed with reference to FIGS. 2, 7 and 8.

[0098] Referring to FIG. 2, in order to enable recovery of the protectedservers 20 a,20 b to be carried out, the catcher hardware 30 isphysically installed and configured at Step 201 generally according tothe manufacturer's instructions (such as by running the ServerRAID™program CD for EBM NetFinity™ servers, or SmartStart™ for CompaqProliant™ servers). The catcher operating system 34 (such as Red HatLinux™ or Windows 2000™) is then installed and its operating parameters(e.g. parameters relating to network cards and/or hard disk managementetc) are configured according to the virtual computing software 33requirements, such requirements being familiar to those skilled in theart. The virtual computing software 33 is then installed, and itsoperating parameters configured according to the manufacturer'sinstructions and specific solution requirements.

[0099] In Step 202, the virtual computing software management console orequivalent user interface 35 is used to create and configure the virtualmachines 31 a,31 b to support the respective protected servers 20 a,20 b(one virtual machine 31 a,31 b being created for each protected server20 a,20 b). Substantially identical copies of the operating system 22a,22 b installed on each protected server 20 a,20 b are installed oneach respective virtual machine 31 a,31 b.

[0100] In Step 203, the installed environment (i.e., the applications 24a,24 b, utilities, and other protected server-resident agents orapplications that interact with the applications 24 a,24 b) present onthe first and second protected servers 20 a,20 b is substantiallyreplicated within its respective virtual machine 31 a,31 b. Thereplication software 36 is then installed on each protected server 20a,20 b and virtual machine 31 a,31 b, and replication scripts 37 arecreated to instruct the mirroring and replication process. The scriptsare then copied to the appropriate virtual machine 31 a,31 b.

[0101] In Step 204, the replication software 36 is activated and mirrors(i.e. copies) each selected protected server file 26 a,26 b onto itsrespective virtual machine 31 a,31 b according to its respectivereplication script 37.

[0102] Step 205 is substantially contiguous with Step 204, whereby oncompletion of the mirroring activity, “replication” is initiated suchthat any subsequent and concurrent changes to the files 26 a,26 b on theprotected servers 20 a,20 b are thereafter more or less continuouslysynchronised with their respective virtual machines 31 a,31 b duringnormal operating conditions. The protected servers 20 a,20 b have fullaccessibility to the network 12 and are thus continuously synchronisingwith their respective virtual machines 31 a,31 b, as illustrated in FIG.7a. This condition persists until a failure or other disruption occursand is detected by the replication software 36.

[0103] Referring now to FIG. 8, Step 801 shows that in the presentconditions where there are no problems on the network 12, the firstprotected source server 20 a has its normal identity A, and that thetarget catcher 30 has its normal identity B. It is important tounderstand that in the presently preferred embodiments of the invention,no two entities on the same network can possess the same networkidentity.

[0104] Returning now to FIG. 2, in Step 206 a pre-defined failure event(such as the culmination of a pre-defined period during which a“heartbeat” signal from the first protected server's replicationsoftware 36 is found to be absent) is detected by the correspondingvirtual machine's monitor program (which is part of the replicationsoftware 36) and triggers batch command file 37. The failure is suchthat the first protected server 20 a is logically (and optionallyphysically) disconnected from the network 12 and can no longersynchronise with its corresponding virtual machine 31 a as shown in FIG.7b.

[0105] In Step 207, the replication script 37, triggered by thereplication software 36. This causes the virtual machine 31 a to assumethe network identity and role of the failed protected server 20 a andstart the application programs 35 a installed on the virtual machine 31a. For example, if the protected server 20 a had been a MicrosoftExchange™ server, then the copy of Microsoft Exchange™ installed on thevirtual machine 31 a would be started. Any other script-drivenprogrammable event is also initiated by the running of the script 37.This occurs without disrupting or substantially affecting any othervirtual machine session on the catcher 30. Transmit messages are thenoptionally transmitted to network users 14 and to notify the remotenetwork management console (which is described later) of the failureevent.

[0106] The running of the batch command file (or replication script) 37causes the protected source server 20 a to have no identity, and thetarget virtual machine 31 a to have its normal identity B together withan assumed identity A, the former identity of the protected server 20 a(as shown in Step 802 of FIG. 8). This set of actions is called“failover” and causes the virtual machine 31 a to enter the failovercondition and substantially replace the functionality of the failedprotected server 20 a. In this manner, a user 14 on the network 12 isable to use the application 35 a installed on the virtual machine 31 aas if it were running from the failed protected server 20 a. Theapplications, files and utilities etc on the virtual machine 31 a beinginteracted with as a result of user (or applications, etc) interaction.

[0107] In Step 208, the failed and now logically disconnected protectedserver 20 a is available to be removed from the network 12 for repair orreplacement. There are instances where it may be beneficial for it to berepaired in situ such as replacement of an accidentally broken networkconnection. Under other circumstances, such as fire damage, it may benecessary for the server 20 a to be replaced in its entirety. In thesecases, physical disconnection of the protected server 20 a from thenetwork 12 is warranted.

[0108] When the protected server 20 a has been replaced or repaired andreconnected to the network 12 (if applicable), the operating system 22a, and applications/utilities 24 a in their current condition are copiedto the protected server 20 a from the virtual machine 31 a (Step 209).The identity A of the protected server 20 a is also restored. However,the protected server 20 a remains logically disconnected from network 12and unsynchronised with its corresponding virtual machine 31 a, asillustrated by FIG. 7c. Step 803 of FIG. 8 shows the protected sourceserver 20 a restored, and the target catcher 30 retaining its normalidentity B and the assumed identity of the protected server, A. At thisstage, the virtual machine 31 a is still running in failover conditionand all applications 24 a on the protected server 20 a are inactive.

[0109] In Step 210, the virtual machine 31 a is commanded manually orremotely via the virtual competing software management user interface 35to relinquish its duplicate identity (i.e. A and B) and role to theprotected server 20 a and maintain its pre-failover identity and role.This is known as the “failback” process and gives rise to the failbackcondition. In this state, neither the protected server 20 a nor thefailback virtual machine 31 a has access to the network 12. However,this only occurs for a short period of time. Step 804 of FIG. 8 showsthe target catcher 31 a retaining its normal identity B but releasingthe identity of the protected server 20 a such that the protected sourceserver identity, A, is restored. The protected server 20 a is thuslogically reconnected to the network 12 (as shown in FIG. 7d).

[0110] In Steps 211 and 805, the protected server applications 24 aremain inactive preventing all user interaction and avoiding changes tothe protected server 20 a itself. The replication software 36 is thenrun against its replication script 37 with the protected server 20 a asits target and the virtual machine 31 a as its source. This causesreplication to take place between the failover virtual machine 31 a andthe protected server 20 a, thereby copying files 26 a from the virtualmachine 31 a to the protected server 20 a, culminating in fullsynchronisation between the two.

[0111] In Step 212, the protected server applications 24 a are activatedand made available to users 14, restoring the protected server 20 a tonormal condition (see Step 205), such that the protected server 20 a hasfill accessibility to the network 12 and is continuously synchronisingwith its corresponding virtual machine 31 a (see FIG. 7e). Step 806 ofFIG. 8 shows that the protected source server 20 a has its normalidentity A, and that the target catcher 31 a has its normal identity B.

[0112] Returning now to FIG. 3a, there are shown first and secondenvironments 10 b,10 c which, when combined, are suitable forimplementing a second embodiment (the “extended remote” method) of thepresent invention which is suitable for use when failure of one or moreprotected servers 20 occurs. The first and second environments 10 b,10 care now described.

[0113] Environment 10 b comprises three protected server computers 20a,20 b,20 c in communication with a protected catcher 30 via a localnetwork 12. This arrangement is referred to hereinafter as the“protected environment”. The protected servers 20 a,20 b,20 c and theprotected catcher 30 all configured in the manner of the firstembodiment of the invention.

[0114] The second environment 10 c comprises a remote local network 16to which are attached three recovery servers 40, a recovery catcher 50,a console 60 (such as a customised PC) and a database 62 (such as SQL,Access or OMNIS). This configuration 10 c is hereinafter known as the“recovery environment”. The protected environment 10 b and the recoveryenvironment 10 c are linked by connecting the local network 12 to theremotely located local network 16 via a telecommunications connection11.

[0115] The recovery catcher 50 (shown in FIG. 3e) has substantially thesame configuration as the protected catcher 30, i.e. it has an operatingsystem 54, and virtual computing software 53 for implementing multiplevirtual machines 51 a,51 b. Each protected catcher virtual machine 31a,31 b etc is represented in the present embodiment by an identicalvirtual machine 51 a,51 b etc on the recovery catcher 50. Such anidentical virtual machine 51 a (shown in FIG. 3c) comprises applications55 a, replication software 36 and a customisable batch file 37,supported by a virtual machine operating system 58.

[0116] Each recovery server 40 is an Intel-based platform withpre-configured hardware and a version of DOS installed on its hard diskdrive that is capable of being booted by the application of power to theserver 40 causing its boot file (such as autoexec.bat and/or config.sys)to be run.

[0117] The console hardware 60 (such as an Intel-compatible PC running aproprietary operating system such as Microsoft Windows NT™) is used tohost console software 65. The console hardware 60 includes a mains powerswitching unit 66 and an event monitoring unit 67. The console hardware60 is attached to the local network 16 through a hardware interface andassociated software. Consequently, the console 60 is capable ofdetecting a protected server failure and responding automatically byselecting and “powering on” an appropriate recovery server 40.

[0118] The console software 65 comprises such algorithms, procedures,programs and files as are required to ensure that the optimum serverrecovery response is delivered in the event of any combination offailure events as may be notified to the console software 65. Theconsole software 65 is additionally required to store such images (i.e.,snapshots of data which correspond to the operating system and files) asare required to permit the protected servers 20 to be replicated onpotentially dissimilar recovery servers 40. Three such images 68 areshown in FIG. 3a.

[0119] The console software 65 also includes a database 62 (such as MSAccess™, or MSDE™) with a record structure as shown in FIG. 3d. Thisrecord comprises the processor type, the processor manufacturer, thespeed of the processor, the size of the random access memory, and thehard disk capacity. That is, such information as is required toprogrammably identify an optimum recovery server 40 from those availableon the remotely located local network 16. The selection criteria for theoptimum recovery server 40 may vary according to circumstances, butinclude server ability to satisfy the minimum operating requirement ofthe image 68 that is to be loaded. Other rules may be imposed by systemmanagers or users 14 of the invention to minimize the exposure to riskin the event that multiple servers 20 fail concurrently, for example byalways selecting the lowest specification server 40 available thatsatisfies the minimum requirement.

[0120] A method of backing up and recovering critical data according tothe second embodiment of the invention is now described with referenceto FIG. 2 and FIGS. 4a and 4 b.

[0121] In Step 401, the protected environment 10 b is created byfollowing previously described Steps 201 through 205 inclusive. That is,firstly the protected catcher 30 is configured and prepared. Secondly,the virtual machines 31 are created and configured on the protectedcatcher 30. Then the protected servers 20 a,20 b are duplicated on theirrespective virtual machines 31 a,31 b, followed by the mirroring offiles from the protected servers 20 a,20 b to the virtual machines 31a,31 b. Next, the replication of ongoing changes to the files/dataresiding on the protected servers 20 a,20 b is carried out, if required.

[0122] In Step 402, the recovery catcher 50 is built by repeating Steps201 and 202, and thereby creating such virtual machines 51 a etc as arenecessary to support each of the protected servers 20 a,b etc which mayrequire remote recovery capability.

[0123] Each virtual machine 51 a etc residing on the recovery catcher 50is uniquely identified to the network 16, and as mentioned previously isequipped with an operating system 58, an application program 55 a andreplication software 36 configuration that is identical to itscounterpart protected virtual machine 31 a etc. This may be achieved bytaking a conventional backup of the protected virtual machine 31 a andrestoring this backup to the corresponding recovery virtual machine 51a.

[0124] In Step 403, the recovery servers 40 are each physically, but notlogically, connected to the network 16 by appropriate means. Each of therecovery servers 40 is pre-configured with only a network boot image(i.e. a bootable operating system that will bring up a server that isnetwork enabled) resident on its hard disk drive, such as is describedin Step 412 below.

[0125] In Step 404, the console 60 and associated database 62 areprepared by specifying in advance which recovery server 40 is capable oftaking over the function and role of each of the protected servers 20,and specifying any preferred recovery servers. This step also includesphysically and logically connecting the console 60 to the network 16 byappropriate means.

[0126] In Step 405, the network connection 11 between the protectedenvironment 10 b and the recovery environment 10 c is established.

[0127] In Step 406, the replication software 36 installed on eachrecovery catcher virtual machine 51 a etc and its correspondingprotected catcher virtual machine 31 a etc are configured and activatedto synchronise the recovery catcher virtual machines 51 a etc with theprotected catcher virtual machines 31 a etc, such that file changestaking place on a given protected server 20 are first replicated to thedesignated protected catcher virtual machine 31 a etc, and where sorequired are subsequently again replicated to the corresponding recoverycatcher virtual machine 51 a etc.

[0128] In Step 410 a protected server 20 fails. As a result of thisfailure, a failover condition is entered at Step 411 such that thedesignated virtual machine 31 a etc on the protected catcher 30substantially immediately adopts the failed protected server's identityand role (as in Step 207 of the previously described method).

[0129] In Step 412, the protected catcher 30 transmits an indication tothe console 60 in order to uniquely identify the failed protectedserver(s) 20. This failure is indicated to the console 60 by sending afile containing the identity of the failed protected server(s) 20 viathe network connection 11. The console 60 is programmed such that onreceipt of this file (or when it detects protected environment 10 bfailure such as the culmination of a pre-defined period during which a“heartbeat” signal from protected catcher 30's replication software 36is found to be absent), it submits the failed protected server 20identity to a pre-installed database application 63 and obtains from thedatabase application the identity of the previously selected best fitrecovery server 40 and a custom-written DOS batch command file 70. Theconsole 60 then enables the supply of power to the selected recoveryserver 40 causing the recovery server to boot using the pre-installednetwork boot image, and then activating the recovery server 40 that willbe used as a physical replacement for the failed protected server. Theindividual steps taken to implement the automatic rebuilding of thereplacement replica protected server 20 are now described with referenceto FIG. 4b.

[0130] In step 4120, power is applied by the console 60 to the recoveryserver 40 causing it to boot and run its boot file as is normal forIntel-based server platforms. The boot file issues a command to theconsole 60 via the network 16 instructing the console 60 to constructand execute a batch file 70 on its hard disk drive. This batch file 70is built to contain instructions that are specific to the identity ofthe failed protected server and the selected recovery server. Theconsole batch file 70 causes a replication software script 37 to becreated that is specific to the identity of the failed protected server20, the selected recovery server 40 and the corresponding virtualmachine 51 a etc on the recovery catcher 50.

[0131] In step 4121, the console batch file 70 runs and uses the alreadyavailable identity of the failed protected server 20 to locate thedatabase-stored pre-prepared image 68 that must be loaded onto therecovery server 40. The console batch file 70 commands this image 68 tobe transmitted across the network 16 and loaded onto the recovery server40 hard disk drive. The image 68 contains a special disk partition(called the “back door”) that enables the failed protected serveroperating system 22 to be loaded as an image, and then activated withoutthe normally required level of manual intervention, such techniquesbeing familiar to those skilled in the art. The image also contains acopy of the replication software 36.

[0132] In Step 4122, the console batch file 70 accesses the recoveryserver 40 via the back door copy of the operating system and starts therecovery server, thereby creating a fully operational and networkedserver environment with no applications installed or running. Theconsole batch file 70 then starts a monitor program that polls thenetwork 16 searching for a replication software session against which torun the specifically prepared replication script 37.

[0133] In Step 4123, the replication software 36 is automaticallystarted from within the recovery server operating system, sending aready signal to the network 16.

[0134] In Step 4124, a console monitor program detects the recoveryserver 40 replication software 36 ready signal and starts thereplication software on the recovery catcher 40 against the specificallyprepared replication script 37, thereby establishing a replicationactivity between the recovery server 40 and the corresponding recoverycatcher virtual machine 51.

[0135] In Step 4125, the replication software 36 on the recovery catchervirtual machine 51 completes the specifically prepared replicationscript 37 causing the applications and other files resident in therecovery catcher virtual machine 51 a etc to be synchronised with therecovery server 40.

[0136] Returning now to FIG. 4a, in Step 413 the mirroring of the failedprotected server 20 and any subsequent changes captured by the protectedcatcher 30, recovery catcher 50 and recovery server 40 is completed, andongoing replication between the recovery catcher virtual machine 51 andthe recovery server 40 is talking place. The console 60 terminatesreplication to the recovery server 40 automatically, on completion ofreplication, or manually, and the recovery server 40 is then optionallyphysically detached from the network 16 and physically attached to thelocal (protected) network 12 ensuring that there is no conflict ofidentity with other servers 20 and virtual machines 31 a etc on thenetwork as in Steps 207 and 208.

[0137] In Step 414, the replication script 37 used to replicate datafrom the protected servers 20 to the protected catcher 30 may be used tosynchronise the protected catcher virtual machines 31 a etc with therecovery servers 40 (now protected servers 20) that have been attachedto the protected network 12. However, a replication script is notabsolutely necessary, and the replication process can be initialisedmanually.

[0138] In Step 415, previous Steps 209 through 212 inclusive arerepeated for each recovery server 40 (now protected server 20) and itscorresponding protected catcher virtual machine 31 a etc, therebysynchronising and restoring protected server 20 functionality to theprotected local network 12.

[0139] On completion of the above described process, the local protectednetwork 12 is substantially restored to its normal operating conditionpermitting additional recovery servers 40 to be profiled and attached tothe recovery network 16. Should a further failure occur on the protectednetwork 12, the additional recovery servers 40 are in place and ready toprovide back-up and recovery of the protected servers 20 for subsequentsimilar purpose should a further failure occur.

[0140] The aforedescribed network combination 10 b,10 c can also be usedfor off-line management activities according to an alternative secondembodiment of the present invention. More particularly, the recoverynetwork 10 b can be used as a means of providing a controlled secureenvironment for the completion of protected system management activitiessuch as testing and installation of upgrades. Such management activitiesare now described with reference to Steps 420 to 423 of FIG. 4a.

[0141] In order to initialise a suitable environment, Steps 401 through406 are completed as previously described.

[0142] In Step 420, an appropriate time to undertake network managementactivities (e.g., lunchtime or the weekend) is identified, and whichprotected server 20 is to be included in the management activity aredetermined.

[0143] In Step 421, the console 60 is used to trigger construction of areplica recovery server 40 in respect of the identified protected server20 using previously described Step 412.

[0144] However, in this method, failover is not performed by theprotected catcher 30 and the protected server 20 continues to operatenormally.

[0145] In Step 422, the console 60 is used to terminate replication fromthe recovery catcher virtual machine 51 a etc to the recovery server 40and this server is available to be physically detached from the recoverynetwork 16, whereupon it physically attached to a test network (notshown), allowing use for system management or other purposes.

[0146] In Step 423, once testing etc has been completed, all operationaldata is cleared down from the detached recovery server 40. Step 403 isnow repeated, restoring the recovery network 16 to its ready condition.

[0147] It is to be appreciated that for testing purposes described aboveand even replacement of failed servers on the local network 12, it isnot necessary in this embodiment for the local network to employ aprotected catcher 30. Rather, all of its functions can be carried out bythe remote catcher 50.

[0148] The previous method described the use of the recovery environment10 c to provide a back-up and recovery system to deal with the failureof protected servers 20. However, the recovery environment 10 c may alsobe used to protect against the failure of the entire protectedenvironment 10 b. The flow diagram in FIG. 4a shows further method Steps430 to 433 in which a way of responding to failure of the entireprotected environment 10 b is described according to a third embodimentof the invention. This circumstance may arise due to widespreadequipment failure, destruction of buildings or loss oftelecommunications or other causes. If such an event occurs, it may bedesirable to reconstruct automatically the protected environment 10 busing the information and equipment in the recovery environment 10 c.

[0149] Firstly, the protected environment 10 b and the recoveryenvironment 10 c are initialised in the manner as previously describedin Steps 401 to 406. In Step 430, the protected environment 10 b failscausing loss of replication between the protected catcher 30 andrecovery catcher 50. In Step 431, the console 60 detects that theprotected environment 10 b has failed (for example, by the culminationof a pre-defined period during which a “heartbeat” signal from protectedcatcher's replication software 36 is found to be absent) and responds byidentifying and automatically powering on and installing the requiredsoftware images 68 onto the optimum set of recovery servers 40 that willbe used as replacements for the failed protected servers 20.

[0150] In this embodiment of the invention, the console 60 submits allfailed protected server 20 identities to a pre-installed database 62and, according to a pre-programmed priority sequence, obtains theidentities of the most appropriate recovery servers 40 and correspondingDOS batch files 70. The console 60 enables the supply of power to therecovery servers 40 causing them to boot using the pre-installed networkboot images. The console 60 then triggers the automatic rebuilding ofreplacement replica servers using the mirroring process (Step 412).

[0151] In Step 432, once the mirroring of the recovery catcher 50 andrecovery server 40 is completed, the recovery servers 40 are thencapable of either being physically detached from the network 16 or,should the need arise, can be used in situ for emergency operationalactivities.

[0152] In Step 433, the protected and recovery environments 10 b,10 care rebuilt in response to specific circumstances. This could includethe rebuilding of the protected environment if the replica recoveryservers 40 are being used in situ, or the rebuilding of the recoveryenvironment if the replica recovery servers have been physicallydetached from the recovery environment. Ultimately, Steps 401 through406 are repeated to re-establish continuous operational capability.

[0153] Where a protected catcher virtual machine 31 a,31 b is notrepresented directly on the recovery catcher 50 it will not be possibleto provide extended recovery of the corresponding protected server 20a,20 b. However, it will be possible to provide simplified recovery,such choice being available to the user by appropriate configuration ofthe virtual computing software 53 running on the recovery catcher 50.

[0154] There may be circumstances where remote location of the recoverynetwork 16 is undesirable or infeasible, or where economics orpracticalities dictate the use of a single catcher server 30 in whichcase there is provided a further method (the extended local method) ofbacking-up and recovering critical data according to a fourth embodimentof the present invention which is now described.

[0155] Referring to FIG. 3b, there is shown an environment 10 d suitablefor implementing the extended local method. The network 12 includesthree protected servers 20, a catcher computer 30, three recoveryservers 40, and a console 60. In this embodiment of the presentinvention, the functions performed by the protected catcher 30 and therecovery catcher 50 of the aforedescribed second and third embodimentsof the invention are combined in the local catcher 30.

[0156] The local catcher 30 is configured substantially identically tothe protected catcher 30 of the second embodiment. The local catcher 30may be required to replicate concurrently with protected servers 20 andrecovery servers 40, and the replication software 36 is configured toallow this to take place.

[0157] The extended local method of the invention is substantiallyidentical to the extended remote method, except that as the recoveryservers 40 and local catcher 30 are part of the same local network 12,they may be concurrently affected by local environment failures whichaffect either the local catcher 30 or the local area network 12. It canbe seen from FIG. 3b that the extended local method is substantiallyidentical to the aforementioned extended remote method of the invention,but it excludes the recovery catcher 50 and the remote recovery network16. The individual steps of the extended local method of the presentembodiment of the invention are now described with reference to the flowcharts shown in FIGS. 4a and 4 b.

[0158] In Step 401, the protected environment 10 b is created byfollowing previously identified steps 201 through 205 inclusive.

[0159] Step 402 is omitted in this embodiment, there being no remotecatcher 50.

[0160] In Step 403, the recovery servers 40 are each connected to thelocal network 12 by appropriate means with only a network boot imagesuch that when power is applied to the recovery server it boots directlyinto the console 60. As previously described, the console 60 includes amains power switching unit 66, and an event monitoring unit 67.

[0161] In Step 404, the console 60 is prepared and connected to thenetwork 12 by appropriate means and its database 62 populated withimages 68 appropriate to the protected environment 10 b.

[0162] Step 405 is omitted in this embodiment of the invention, therebeing no external networking requirement.

[0163] Step 406 is also omitted in this embodiment of the invention,there being no requirement to connect or synchronise the protected 20and recovery catchers 50, these being replaced by local catcher 30.

[0164] In Step 410, a protected server 20 fails.

[0165] In Step 411, a failover condition is entered such that thedesignated virtual machine 31 on the local catcher 30 adopts the failedprotected server's identity and role.

[0166] In Step 412, the local catcher 30 transmits an indication to theconsole 60 uniquely identifying the failed protected server 20. Theconsole 60 is programmed so that when such indication is received, orwhen it detects protected environment failure (such as the culminationof a pre-defined period during which a “heartbeat” signal from localcatcher's replication software 36 is found to be absent) it responds byidentifying and automatically powering on the optimum recovery server 40that will be used as a replacement for the failed protected server 20,and installing the required software image 68.

[0167] Failure indication is achieved by the console 60 receiving a filecontaining the identity of the failed protected server 20 via thenetwork connection 12 using File Transfer, Protocol (FTP). On receipt ofthis file, the console 60 submits the failed protected server identityto the pre-installed database 62, and obtains from the database theidentity of the most appropriate recovery server 40 and a correspondingDOS batch file 70. The console 60 enables the supply of power to themost appropriate recovery server 40 causing it to boot using thepre-installed network boot image (not shown).

[0168] The console 60 activates the recovery server(s) 40 according tothe steps set out in FIG. 4b. As the process by which this takes placeis identical to that carried out for the remote recovery network (whichhas already been explained), the details of the process are not includedhere.

[0169] Returning now to FIG. 4a, at Step 413 the mirroring of the failedprotected server 20 and any subsequent changes captured by the localcatcher 30 and the recovery server 40 is completed, and ongoingreplication of the failed protected server 20 takes place. The recoveryserver 40 is then optionally left in situ (as it is already accessibleto users on the network 12), or it is physically detached thenre-attached to/from the network 12 using the methods previouslydescribed in Steps 207 to 208.

[0170] In Step 414, the replication script 37 used to replicate datafrom the protected servers 20 to the local catcher 30 may be used tosynchronise the local catcher virtual machines 31 a etc with therecovery servers 40 (now protected servers 20) that have been attachedto the protected network. However, a replication script is notabsolutely necessary, and the replication process can be initialisedmanually.

[0171] In Step 415, previous Steps 209 through 212 inclusive arerepeated for each recovery server 40 and its corresponding local catchervirtual machine 31 a etc, thereby synchronising and restoring protectedserver 20 functionality to the network 12.

[0172] On completion of the above described process, the protectednetwork 12 is substantially restored to its normal operating conditionpermitting additional recovery servers 40 to be profiled and attached tothe recovery network in readiness for further failures, should theyoccur.

[0173] The fifth, and final, embodiment of the invention concerns theuse of a catcher computer 30 for recording multiple snapshots of theprotected servers 20. Referring now to FIG. 5, there is shown anenvironment 10 e suitable for implementing the fifth embodiment of theinvention which is known as the “applied” method.

[0174] The environment 10 e comprises a local network 12, a protectedserver 20 and a catcher 30, these being configured in the manner alreadydescribed herein. The protected server 20 includes applications 24,files 26, replication software 36, an operating system 22 which togetherare called the “applied environment”. However, in the present embodimentof the invention, the catcher 30 supports five virtual machines 31 a to31 e each hosting a replica of the same protected server operatingenvironment, and each virtual machine capable of running replicationsoftware 36. The catcher 30 also supports a programmable schedulerapplication that is capable of issuing commands to other programs atcertain times and/or after certain delays and/or on the occurrence ofcertain predetermined events.

[0175] In the present embodiment of the invention, virtual machine 31 ais designated the “target” virtual machine, and virtual machines 31 b to31 e are designated “rollback” virtual machines which store different“snapshots” of the state of the protected server 20 over time as isexplained below.

[0176] Referring now to FIG. 6, in Step 601 the applied environment 10 eis created by following previously described Steps 201 through 203inclusive, thereby creating and installing the target virtual machine31a so that it contains a near-replica of the protected server'soperating system 22, files 26, and including applications 24 and dataand mirroring software 36.

[0177] In Step 602, the target virtual machine 31a is copied to createadditional identical rollback virtual machines 31 b to 31 e on thecatcher 30. Replication is then initiated from the protected server 20to the target virtual machine 31 a as previously described in Steps 204and 205.

[0178] In Step 603, the scripts 37 that direct the catcher replicationsoftware 36 in each of the rollback virtual machines 31 b through 31 eare programmed such that they mirror data from the catcher targetvirtual machine 31 a to each respective rollback virtual machine 31 b to31 e when so commanded in the manner previously described herein.

[0179] In Step 604, the scheduler program is programmed to schedule therunning of the respective rollback virtual machine replication scripts37 to run at such times and/or after such intervals as may be required.

[0180] In Step 605, at the scheduled time t1 the replication software 36in rollback virtual machine 31 b is triggered, and virtual machine 31 ais mirrored to virtual machine 31 b.

[0181] In Step 606, virtual machine 31 a is mirrored to virtual machine31 c at time t2. In Step 607, virtual machine 31 a is mirrored tovirtual machine 31 d at time t3. In Step 608, virtual machine 31 a ismirrored to virtual machine 31 e at time t4.

[0182] On completion of Steps 605 through 608, the cycle is resumed(starting at Step 605) until such time as the cycle is interrupted forwhatever reason. This sequence of events allows the production of aseries of ‘snapshot’ copies of the target virtual machine 31 a and hencethe protected server 20. As each snapshot is taken at different pointsin time, this allows the selection of the most appropriate snapshotunder varying circumstances. For example, if data corruption wasdetected but actually occurred 6 hours ago, then the last snapshot takenprior to the corruption might be used to retrieve data which mightotherwise be lost.

[0183] In Step 610, during and concurrent with Steps 605 to 608inclusive, the replication software 36 running on the target virtualmachine 31 a is in normal condition and is continuously monitoring forpre-defined corruption events (such as notification that a virus orother data corruption or threat) impinging on the protected server 20.

[0184] In Step 611, the target virtual machine 31 a detects a corruptionevent.

[0185] In Step 612, the target virtual machine 31 a terminatesreplication to all catcher rollback virtual machines 31 b to 31 einclusive, and runs a batch file (not shown) containing commandsprogrammed to reduce the risk of further corruption (such as suspendingall network sessions).

[0186] In Step 613, the cause of the disruption is either manually orautomatically analysed, and the appropriate corrective measures are thenapplied to all affected network components and virtual machines 31 (suchas running an anti-virus product, causing all network components to bedisinfected).

[0187] In Step 614, it is decided whether to continue using theprotected server 20, or whether to fall back to one of the rollbackvirtual machines 31 b to 31 e. Such a decision may be taken based on theresults of the aforementioned analysis, and may take into account theextent of protected server 20 corruption, and the age of the informationheld on the rollback virtual machine.

[0188] At Step 615, in the event of direct corrective measures provingineffective then, if the decision is taken to use a rollback virtualmachine, the most recent viable rollback virtual machine is identified.The protected server 20 is then detached from the network 12 and rebuiltas described in Steps 208 through 212. Then the chosen viable virtualmachine is then either manually or automatically failed back onto thephysical protected server 510.

[0189] It can be seen that by using the applied method described herein,it is possible to maintain a more or less continuous fallback positionfor any protected server against a variety of different types ofcorruption that might be passed on to the virtual machine when using thefirst embodiment of the invention. In addition to this, the appliedmethod may also be used in combination with the other embodiments of theinvention disclosed herein.

[0190] Having described particular preferred embodiments of the presentinvention, it is to be appreciated that the embodiments in question areexemplary only and that variations and modifications such as will occurto those possessed of the appropriate knowledge and skills may be madewithout departure from the spirit and scope of the invention as setforth in the appended claims. For example, each separate networkenvironments described herein are shown to include only a single catchercomputer. However, multiple catcher computers could be used in both theprotected and recovery environments, for example, in order to provide aback-up and recovery function for a very large network. Furthermore,simultaneous failure of two or more protected servers can be handle byrelying on the catcher utilising two or more respective virtual machinesto switch in recovery servers.

1. A recovery system for sustaining the operation of a plurality ofnetworked computers in the event of a fault condition, the systemcomprising: a plurality of virtual machines installed on a recoverycomputer, each virtual machine being arranged to emulate a correspondingnetworked computer, and the recovery computer being arranged, in theevent of a detected failure of one of the networked computers, toactivate and use the virtual machine which corresponds to the failednetworked computer.
 2. A recovery system according to claim 1, whereinthe recovery computer is arranged to connect a virtual machine,corresponding to the failed network computer, to a communicationsnetwork of the networked computers in the event of failure of thenetworked computer.
 3. A recovery system according to claim 2, whereinthe recovery computer is arranged to assign a pre-stored unique networkidentity of the failed network computer to the virtual machine in theevent of failure of the networked computer.
 4. A recovery systemaccording to any preceding claim, further comprising monitoring meansarranged to monitor the operational condition of each of the networkedcomputers and to generate an alert signal in response to detecting afault condition relating to one or more of the plurality of computers.5. A recovery system according to any preceding claim, wherein therecovery computer is responsive to the alert signal to activate and usethe virtual machine corresponding to the failed network computer.
 6. Arecovery system according to any preceding claim, further comprisingupdating means for updating the recovery server with collected datarelating to the specific operation of each of the network computers, theupdating means being arranged to store the collected data in theappropriate files relating to each of the virtual machines.
 7. Arecovery system according to claim 6, wherein the updating meanscomprises a component provided on a network computer and anothercomponent provided at the recovery server, the components operatingtogether to synchronise the updating of the respective virtual machineat the recovery computer.
 8. A recovery system according to anypreceding claim, wherein each virtual machine includes an updatedversion of the data environment present on its corresponding networkcomputer.
 9. A recovery system according to claim 8, wherein the dataenvironment of each virtual machine includes its own operating systemcorresponding to that of its networked computer, such that a pluralityof computing environments are hosted on a single recovery computer. 10.A recovery system according to any preceding claim, wherein the recoverycomputer is arranged to host entirely different computing environmentson the recovery computer by isolating each of the plurality of virtualmachines from each other.
 11. A recovery system according to anypreceding claim, wherein the networked computers are connected togetherby a local communications network and the recovery computer locatedremote therefrom and is arranged to communicate with the networkcomputers over a wide-area communications network.
 12. A recovery systemaccording to any of claim 1 to 10, wherein the networked computers andthe recovery computer are connected together by a local communicationsnetwork and the system further comprises a further recovery computerlocated remote therefrom and arranged to communicate with the networkcomputers and recovery computer via a wide-area communications network.13. A recovery system according to claim 12, wherein the recoverycomputer is arranged to transmit a repetitive condition status signal tothe another recovery computer, and the another recovery computer isarranged to signal a fault condition if there is an interruption in thereceipt of that condition status signal.
 14. A recovery system accordingto any of claims 11 to 13, further comprising a plurality of back-upcomputers arranged to back-up the networked computers and a controllercontrolling the configuring of the back-up computers in response to afault condition caused by one of the networked computers.
 15. A recoverysystem according to claim 14, further comprising a store operativelycoupled to the controller, the controller being arranged to storereconfiguration images of each of the networked computers in the storefor use in the event of a fault condition.
 16. A recovery systemaccording to claim 15, wherein the controller comprises means forswitching on the backup computers, and for loading reconfigurationimages of the networked computers into a plurality of the back-upcomputers.
 17. A recovery system for rapidly creating a replacementnetwork computer arranged, in the event of a fault condition, to restorethe operation of a plurality of network computers connected via a firstcommunications network, the system comprising: a plurality of virtualmachines installed on a recovery computer, each virtual machine beingarranged to emulate a corresponding network computer, a plurality ofback-up computers, a console arranged to store composition images of theplurality of network computers; and a remotely located secondcommunications network connecting together the recovery computer,back-up computers and console, wherein the console is arranged, in theevent of failure of one of the networked computers, to configure aselected back-up computer with the composition image of the failednetwork computer for physical replacement of the failed networkcomputer.
 18. A recovery system according to claim 17, wherein theselected back-up computer is removable from the second communicationsnetwork and is arranged to be attachable to the first communicationsnetwork for physical replacement of the failed network computer.
 19. Arecovery system according to claim 17 or 18, wherein the selectedback-up computer is configured to have the same network identity as thefailed network computer.
 20. A recovery system according to any ofclaims 17 to 19, wherein the console is arranged to update regularlycomposition images of the network computers with operation informationreceived from the first network regarding the current operation of thenetwork computers.
 21. A recovery system according to any of claims 17to 20, wherein the console is arranged to power up the selected back-upserver by electronically switching power to the selected back-up serverin the event of a sensed failure condition.
 22. A recovery systemaccording to any of claims 17 to 21, further comprising means formonitoring the condition of the plurality of network computers and forgenerating an alert when a fault condition is sensed.
 23. A recoverysystem according to claim 22, wherein the monitoring means is arrangedto sense a regular signal received from the first network indicating anon-fault condition and to sense the fault condition when there is abreak in the reception of the regular signal.
 24. A method of sustainingthe operation of a plurality of networked computers connected via acommunications network, in the event of a fault condition, the methodcomprising: emulating the plurality of networked computers on acorresponding plurality of virtual machines installed on a recoverycomputer; detecting a failure of a networked computer, attaching thevirtual machine corresponding to the failed network computer to thenetwork; and activating and using the virtual machine which correspondsto the failed networked computer.
 25. A method according to claim 24,wherein the emulating step comprises a once-only mirroring of a dataenvironment from a network computer to a corresponding recoverycomputer.
 26. A method according to claim 24 or 25, wherein theemulating step further comprises identifying, for a network computer,changes between a current state of the data environment and a previouslystored version of the data environment at the virtual machine, notifyingthe virtual machine of those chances and updating the previously storedversion of the data environment with the changes.
 27. A method accordingto any of claims 24 to 26, wherein the detecting step comprisesmonitoring the reception of a regular signal from each of the networkcomputers and generating an alert signifying a fault condition if anexpected signal is not received.
 28. A method according to any of claims24 to 27, wherein the activation step comprises activating computerapplications on the virtual machine which were in use on the failednetwork computer.
 29. A method according to any of claims 24 to 28,further comprising maintaining a copy of each of the network computer'sdata environments and in the event of a failure being detected,uploading the copy to a back-up computer.
 30. A method according toclaim 29, further comprising installing the uploaded back-up computer onthe network of the failed network computer; identifying changes betweena current state of the data environment of the corresponding virtualmachine and the data environment stored on the back-up computer, andupdating the previously stored version of the data environment on theback-up computer with the changes.
 31. A method according to claim 30,further comprising activating the back-up computer to assume theidentity of the failed network computer, and replace the same; andrelinquishing the virtual machine from its back up operation for thefailed network computer.
 32. A method according to any of claims 24 to31, further comprising storing multiple copies of a network computer'sdata environment in independent virtual machines, each copy beingcreated at a different time to provide plurality of versions of thenetwork computer in time, wherein the attaching step includes selectingthe most appropriate virtual machines version of the network computerand attaching that virtual machine to the network.
 33. A method ofrapidly creating a replacement network computer for restoring theoperation of a plurality of network computers in the event of a detectedfault condition, the method comprising: emulating the networkedcomputers on a corresponding plurality of virtual machines installed ona remotely located recovery computer, the emulation including thecreation and storage of an image of the data environment of each of thenetworked computers; detecting a failure of a networked computer;selecting a remotely located back-up computer; transferring thecorresponding image of the failed network computer onto the remotelylocated back-up computer; and activating and using the back-up computerto restore the operation of the plurality of network computers.
 34. Amethod of carrying out system management activities on one of aplurality of network computers, the method comprising: emulating thenetworked computers on a corresponding plurality of virtual machinesinstalled on a remotely located recovery computer, the emulationincluding the creation and storage of an image of the data environmentof each of the networked computers; selecting a remotely located back-upcomputer; transferring the corresponding image of the network computeronto the remotely located back-up computer whilst maintaining networkedcomputer operation; attaching the back-up computer to a test network andactivating the same; and carrying out testing of the back up computer toeffect the system management activities.
 35. A method of providing arecovery system for sustaining the operation of a networked computer inthe event of a failure condition, the method comprising: emulating anetworked computer on a first virtual machine and a second virtualmachine installed on a recovery computer, the first and second virtualmachines containing images of the data environment of the networkedcomputer taken at different time periods and selecting in view of afailure of the networked computer, the virtual machine representing themost appropriate time period to replace the failed networked computer.36. A method according to claim 35, wherein the emulating step comprisesemulating the networked computer on one or more further virtual machinesinstalled on the recovery computer, each further virtual machinecontaining an image of the data environment of the networked computertaken at a time different to that of the other virtual machines.
 37. Amethod according to claim 35 or 36, wherein the emulating step comprisescopying an image from the first virtual machine to the second virtualmachine, and updating the second virtual machine with changes in thedata environment of the corresponding network computer at a later pointin time.
 38. A method according to any of claims 35 to 37, wherein theselecting step comprises analysing each virtual machine to determine thepresence of data corruption and selecting the most recently updatedvirtual machine which is free from data corruption.